

AXXXXXXXXXXXXXXXXXXXXXX
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
Step 1 Load the packages which we will use (don’t forget to install it before!).
library(readtext)
library(tidyverse)
library(quanteda)
library(rvest)
library(quanteda.textstats)
library(quanteda.textplots)
library(stringr)
library(spacyr)
library(xml2)
library(ggsci)If you don’t want to go through this, you might go to the session “3. Alternatively… import a CSV file!”.
ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
Step 2.1 Scrap XXXXXXXXXX to get the episodes transcriptions.
datalist_himym <- readtext::readtext("texts/how-i-met-your-mother/*.txt")
paste("CHECK WITH JORGE WHERE DID HE TAKE THE EPISODES TRANSCRIPTS")## [1] "CHECK WITH JORGE WHERE DID HE TAKE THE EPISODES TRANSCRIPTS"
Step 2.2 Scrap Wikipedia to get the episodes and characters metadata and clean it!
### CHECK WITH JORGE WHERE DID HE TAKE THE EPISODES TRANSCRIPTS
url_himym <- "https://en.wikipedia.org/wiki/List_of_How_I_Met_Your_Mother_episodes"
url_himym_characters <- "https://en.wikipedia.org/wiki/List_of_How_I_Met_Your_Mother_characters"
l_tables_himym <- url_himym %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = TRUE)
l_tables_himym## [[1]]
## # A tibble: 10 × 6
## Season Episodes Episodes `Originally aired` Originally a…¹ Rank
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Season Episodes Episodes First aired Last aired Rank
## 2 1 22 22 September 19, 2005 (2005-09-19) May 15, 2006 … 54
## 3 2 22 22 September 18, 2006 (2006-09-18) May 14, 2007 … 61
## 4 3 20 20 September 24, 2007 (2007-09-24) May 19, 2008 … 70
## 5 4 24 24 September 22, 2008 (2008-09-22) May 18, 2009 … 49
## 6 5 24 24 September 21, 2009 (2009-09-21) May 24, 2010 … 42
## 7 6 24 24 September 20, 2010 (2010-09-20) May 16, 2011 … 48
## 8 7 24 24 September 19, 2011 (2011-09-19) May 14, 2012 … 45
## 9 8 24 24 September 24, 2012 (2012-09-24) May 13, 2013 … 42
## 10 9 24 24 September 23, 2013 (2013-09-23) March 31, 201… 28
## # … with abbreviated variable name ¹`Originally aired`
##
## [[2]]
## # A tibble: 22 × 8
## No.overall `No. inseason` Title Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 1 "\"Pilot\"" Pamela… Carter… Septem… 1ALH79 10.94[…
## 2 2 2 "\"Purple … Pamela… Carter… Septem… 1ALH01 10.40[…
## 3 3 3 "\"Sweet T… Pamela… Phil L… Octobe… 1ALH02 10.44[…
## 4 4 4 "\"Return … Pamela… Kourtn… Octobe… 1ALH03 9.84[1…
## 5 5 5 "\"Okay Aw… Pamela… Chris … Octobe… 1ALH04 10.14[…
## 6 6 6 "\"Slutty … Pamela… Brenda… Octobe… 1ALH05 10.89[…
## 7 7 7 "\"Matchma… Pamela… Chris … Novemb… 1ALH07 10.55[…
## 8 8 8 "\"The Due… Pamela… Gloria… Novemb… 1ALH06 10.35[…
## 9 9 9 "\"Belly F… Pamela… Phil L… Novemb… 1ALH09 10.29[…
## 10 10 10 "\"The Pin… Pamela… Carter… Novemb… 1ALH08 12.27[…
## # … with 12 more rows, and abbreviated variable names ¹`Directed by`,
## # ²`Written by`, ³`Original air date`, ⁴Prod.code, ⁵`US viewers(millions)`
##
## [[3]]
## # A tibble: 22 × 8
## No.overall `No. inseason` Title Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 23 1 "\"Where W… Pamela… Carter… Septem… 2ALH01 10.48[…
## 2 24 2 "\"The Sco… Rob Gr… Chris … Septem… 2ALH02 9.14[3…
## 3 25 3 "\"Brunch\… Pamela… Stephe… Octobe… 2ALH03 9.32[3…
## 4 26 4 "\"Ted Mos… Pamela… Kristi… Octobe… 2ALH04 9.09[3…
## 5 27 5 "\"World's… Pamela… Brenda… Octobe… 2ALH06 9.05[3…
## 6 28 6 "\"Aldrin … Pamela… Jamie … Octobe… 2ALH05 9.59[3…
## 7 29 7 "\"Swarley… Pamela… Greg M… Novemb… 2ALH07 8.22[3…
## 8 30 8 "\"Atlanti… Pamela… Maria … Novemb… 2ALH08 9.33[3…
## 9 31 9 "\"Slap Be… Pamela… Kourtn… Novemb… 2ALH09 8.85[3…
## 10 32 10 "\"Single … Pamela… Kristi… Novemb… 2ALH10 9.85[3…
## # … with 12 more rows, and abbreviated variable names ¹`Directed by`,
## # ²`Written by`, ³`Original air date`, ⁴Prod.code, ⁵`US viewers(millions)`
##
## [[4]]
## # A tibble: 20 × 8
## No.overall `No. inseason` Title Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 45 1 "\"Wait fo… Pamela… Carter… Septem… 3ALH01 8.12[4…
## 2 46 2 "\"We're N… Pamela… Chris … Octobe… 3ALH02 7.88[5…
## 3 47 3 "\"Third W… Pamela… David … Octobe… 3ALH04 7.96[5…
## 4 48 4 "\"Little … Rob Gr… Kourtn… Octobe… 3ALH03 7.71[5…
## 5 49 5 "\"How I M… Pamela… Gloria… Octobe… 3ALH05 8.50[5…
## 6 50 6 "\"I'm Not… Pamela… Jonath… Octobe… 3ALH07 8.55
## 7 51 7 "\"Dowiset… Pamela… Brenda… Novemb… 3ALH06 8.77[5…
## 8 52 8 "\"Spoiler… Pamela… Stephe… Novemb… 3ALH08 8.58[5…
## 9 53 9 "\"Slapsgi… Pamela… Matt K… Novemb… 3ALH09 8.06[5…
## 10 54 10 "\"The Yip… Pamela… Jamie … Novemb… 3ALH10 7.91[5…
## 11 55 11 "\"The Pla… Pamela… Carter… Decemb… 3ALH11 8.49
## 12 56 12 "\"No Tomo… Pamela… Carter… March … 3ALH12 9.73
## 13 57 13 "\"Ten Ses… Pamela… Chris … March … 3ALH14 10.67[…
## 14 58 14 "\"The Bra… Pamela… Joe Ke… March … 3ALH13 9.50[5…
## 15 59 15 "\"The Cha… Pamela… Carter… April … 3ALH15 7.99[6…
## 16 60 16 "\"Sandcas… Pamela… Kourtn… April … 3ALH16 8.45[6…
## 17 61 17 "\"The Goa… Pamela… Stephe… April … 3ALH17 8.84[6…
## 18 62 18 "\"Rebound… Pamela… Jamie … May 5,… 3ALH18 8.36[6…
## 19 63 19 "\"Everyth… Pamela… Jonath… May 12… 3ALH19 8.93[6…
## 20 64 20 "\"Miracle… Pamela… Carter… May 19… 3ALH20 7.99[6…
## # … with abbreviated variable names ¹`Directed by`, ²`Written by`,
## # ³`Original air date`, ⁴Prod.code, ⁵`US viewers(millions)`
##
## [[5]]
## # A tibble: 24 × 8
## No.overall `No. inseason` Title Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 65 1 "\"Do I Kn… Pamela… Carter… Septem… 4ALH01 9.79[6…
## 2 66 2 "\"The Bes… Pamela… Carter… Septem… 4ALH02 8.72[6…
## 3 67 3 "\"I Heart… Pamela… Greg M… Octobe… 4ALH04 8.97[6…
## 4 68 4 "\"Interve… Michae… Stephe… Octobe… 4ALH03 9.25[6…
## 5 69 5 "\"Shelter… Pamela… Chris … Octobe… 4ALH05 9.45[7…
## 6 70 6 "\"Happily… Pamela… Jamie … Novemb… 4ALH06 9.40[7…
## 7 71 7 "\"Not a F… Pamela… Robia … Novemb… 4ALH07 9.79[7…
## 8 72 8 "\"Woooo!\… Pamela… Carter… Novemb… 4ALH09 9.99[7…
## 9 73 9 "\"The Nak… Pamela… Joe Ke… Novemb… 4ALH08 10.04[…
## 10 74 10 "\"The Fig… Pamela… Theres… Decemb… 4ALH10 10.49[…
## # … with 14 more rows, and abbreviated variable names ¹`Directed by`,
## # ²`Written by`, ³`Original air date`, ⁴Prod.code, ⁵`US viewers(millions)`
##
## [[6]]
## # A tibble: 24 × 8
## No.overall `No. inseason` Title Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 89 1 "\"Definit… Pamela… Carter… Septem… 5ALH01 9.09[9…
## 2 90 2 "\"Double … Pamela… Matt K… Septem… 5ALH02 8.73[9…
## 3 91 3 "\"Robin 1… Pamela… Carter… Octobe… 5ALH03 8.23[9…
## 4 92 4 "\"The Sex… Pamela… Kourtn… Octobe… 5ALH04 8.56[9…
## 5 93 5 "\"Duel Ci… Pamela… Chuck … Octobe… 5ALH05 8.07[9…
## 6 94 6 "\"Bagpipe… Pamela… Robia … Novemb… 5ALH06 8.82[9…
## 7 95 7 "\"The Rou… Pamela… Chris … Novemb… 5ALH07 8.82[9…
## 8 96 8 "\"The Pla… Pamela… Carter… Novemb… 5ALH08 8.44[9…
## 9 97 9 "\"Slapsgi… Pamela… Jamie … Novemb… 5ALH09 8.75[9…
## 10 98 10 "\"The Win… Pamela… Joe Ke… Decemb… 5ALH11 8.79[9…
## # … with 14 more rows, and abbreviated variable names ¹`Directed by`,
## # ²`Written by`, ³`Original air date`, ⁴Prod.code, ⁵`US viewers(millions)`
##
## [[7]]
## # A tibble: 24 × 8
## No.overall `No. inseason` Title Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 113 1 "\"Big Day… Pamela… Carter… Septem… 6ALH01 8.79[1…
## 2 114 2 "\"Cleanin… Pamela… Stephe… Septem… 6ALH02 9.00[1…
## 3 115 3 "\"Unfinis… Pamela… Jamie … Octobe… 6ALH03 8.60[1…
## 4 116 4 "\"Subway … Pamela… Chris … Octobe… 6ALH04 8.48[1…
## 5 117 5 "\"Archite… Pamela… Carter… Octobe… 6ALH05 8.05[1…
## 6 118 6 "\"Baby Ta… Pamela… Joe Ke… Octobe… 6ALH07 8.29[1…
## 7 119 7 "\"Canning… Pamela… Chuck … Novemb… 6ALH06 8.88[1…
## 8 120 8 "\"Natural… Pamela… Carter… Novemb… 6ALH09 8.87[1…
## 9 121 9 "\"Glitter… Pamela… Kourtn… Novemb… 6ALH08 8.87[1…
## 10 122 10 "\"Blitzgi… Pamela… Theres… Novemb… 6ALH10 8.73[1…
## # … with 14 more rows, and abbreviated variable names ¹`Directed by`,
## # ²`Written by`, ³`Original air date`, ⁴Prod.code, ⁵`US viewers(millions)`
##
## [[8]]
## # A tibble: 23 × 8
## No.overall `No. inseason` Title Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 137 1 "\"The Bes… Pamela… Carter… Septem… 7ALH01 11.00[…
## 2 138 2 "\"The Nak… Pamela… Stephe… Septem… 7ALH02 12.22[…
## 3 139 3 "\"Ducky T… Rob Gr… Carter… Septem… 7ALH03 10.50[…
## 4 140 4 "\"The Sti… Pamela… Kourtn… Octobe… 7ALH04 10.39[…
## 5 141 5 "\"Field T… Pamela… Jamie … Octobe… 7ALH06 8.89[1…
## 6 142 6 "\"Mystery… Pamela… Chuck … Octobe… 7ALH05 9.81[1…
## 7 143 7 "\"Noretta… Pamela… Matt K… Octobe… 7ALH07 9.87[1…
## 8 144 8 "\"The Slu… Pamela… Tami S… Octobe… 7ALH08 10.49[…
## 9 145 9 "\"Disaste… Michae… Robia … Novemb… 7ALH09 10.28[…
## 10 146 10 "\"Tick Ti… Pamela… Chris … Novemb… 7ALH10 10.42[…
## # … with 13 more rows, and abbreviated variable names ¹`Directed by`,
## # ²`Written by`, ³`Original air date`, ⁴Prod.code, ⁵`US viewers(millions)`
##
## [[9]]
## # A tibble: 23 × 8
## No.overall `No. inseason` Title Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 161 1 "\"Farhamp… Pamela… Carter… Septem… 8ALH01 8.84[1…
## 2 162 2 "\"The Pre… Pamela… Carter… Octobe… 8ALH02 8.17[1…
## 3 163 3 "\"Nannies… Pamela… Chuck … Octobe… 8ALH03 7.82[1…
## 4 164 4 "\"Who Wan… Pamela… Matt K… Octobe… 8ALH04 7.93[1…
## 5 165 5 "\"The Aut… Pamela… Kourtn… Novemb… 8ALH06 7.22[1…
## 6 166 6 "\"Splitsv… Pamela… Stephe… Novemb… 8ALH05 7.95[1…
## 7 167 7 "\"The Sta… Pamela… Tami S… Novemb… 8ALH07 7.45[1…
## 8 168 8 "\"Twelve … Pamela… Eric F… Novemb… 8ALH08 8.73[1…
## 9 169 9 "\"Lobster… Pamela… Barbar… Decemb… 8ALH09 8.26[1…
## 10 170 10 "\"The Ove… Pamela… Craig … Decemb… 8ALH10 8.82[1…
## # … with 13 more rows, and abbreviated variable names ¹`Directed by`,
## # ²`Written by`, ³`Original air date`, ⁴Prod.code, ⁵`US viewers(millions)`
##
## [[10]]
## # A tibble: 23 × 8
## No.overall `No. inseason` Title Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 185 1 "\"The Loc… Pamela… Carter… Septem… 9ALH01 9.40[1…
## 2 186 2 "\"Coming … Pamela… Carter… Septem… 9ALH02 9.40[1…
## 3 187 3 "\"Last Ti… Pamela… Craig … Septem… 9ALH04 7.87[1…
## 4 188 4 "\"The Bro… Pamela… Matt K… Octobe… 9ALH03 7.53[1…
## 5 189 5 "\"The Pok… Pamela… Dan Gr… Octobe… 9ALH05 7.98[1…
## 6 190 6 "\"Knight … Pamela… Chris … Octobe… 9ALH07 7.64[1…
## 7 191 7 "\"No Ques… Pamela… Stephe… Octobe… 9ALH06 7.63[1…
## 8 192 8 "\"The Lig… Pamela… Rachel… Novemb… 9ALH08 8.67[1…
## 9 193 9 "\"Platoni… Pamela… George… Novemb… 9ALH10 8.08[1…
## 10 194 10 "\"Mom and… Pamela… Carter… Novemb… 9ALH09 8.11[1…
## # … with 13 more rows, and abbreviated variable names ¹`Directed by`,
## # ²`Written by`, ³`Original air date`, ⁴Prod.code, ⁵`US viewers(millions)`
##
## [[11]]
## # A tibble: 10 × 26
## Season Season Episo…¹ Episo…² Episo…³ Episo…⁴ Episo…⁵ Episo…⁶ Episo…⁷ Episo…⁸
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "Seas… Season 1 2 3 4 5 6 7 8
## 2 "" 1 10.9 10.4 10.4 9.84 10.1 10.9 10.6 10.4
## 3 "" 2 10.5 9.14 9.32 9.09 9.05 9.59 8.22 9.33
## 4 "" 3 8.12 7.88 7.96 7.71 8.5 8.55 8.77 8.58
## 5 "" 4 9.79 8.72 8.97 9.25 9.45 9.4 9.79 9.99
## 6 "" 5 9.09 8.73 8.23 8.56 8.07 8.82 8.82 8.44
## 7 "" 6 8.79 9 8.6 8.48 8.05 8.29 8.88 8.87
## 8 "" 7 11 12.2 10.5 10.4 8.89 9.81 9.87 10.5
## 9 "" 8 8.84 8.17 7.82 7.93 7.22 7.95 7.45 8.73
## 10 "" 9 9.4 9.4 7.87 7.53 7.98 7.64 7.63 8.67
## # … with 16 more variables: `Episode number` <dbl>, `Episode number` <dbl>,
## # `Episode number` <dbl>, `Episode number` <dbl>, `Episode number` <dbl>,
## # `Episode number` <dbl>, `Episode number` <dbl>, `Episode number` <dbl>,
## # `Episode number` <dbl>, `Episode number` <dbl>, `Episode number` <dbl>,
## # `Episode number` <dbl>, `Episode number` <dbl>, `Episode number` <dbl>,
## # `Episode number` <dbl>, `Episode number` <dbl>, and abbreviated variable
## # names ¹`Episode number`, ²`Episode number`, ³`Episode number`, …
##
## [[12]]
## # A tibble: 4 × 2
## .mw-parser-output .navbar{display:inline;font-size:88%;font-weight:n…¹ .mw-p…²
## <chr> <chr>
## 1 Episodes "Seaso…
## 2 Characters "Ted M…
## 3 Soundtracks "\"Let…
## 4 Related articles "Award…
## # … with abbreviated variable names
## # ¹`.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vteHow I Met Your Mother`,
## # ²`.mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vteHow I Met Your Mother`
##
## [[13]]
## # A tibble: 11 × 2
## `vteHow I Met Your Mother episodes` `vteHow I Met Your Mother episodes`
## <chr> <chr>
## 1 "Seasons: 1\n2\n3\n4\n5\n6\n7\n8\n9" "Seasons: 1\n2\n3\n4\n5\n6\n7\n8\n9"
## 2 "Season 1" "\"Pilot\"\n\"Purple Giraffe\"\n\"Belly…
## 3 "Season 2" "\"Where Were We?\"\n\"Ted Mosby: Archi…
## 4 "Season 3" "\"Wait for It\"\n\"We're Not from Here…
## 5 "Season 4" "\"Do I Know You?\"\n\"The Best Burger …
## 6 "Season 5" "\"Definitions\"\n\"Double Date\"\n\"Ro…
## 7 "Season 6" "\"Big Days\"\n\"Cleaning House\"\n\"Un…
## 8 "Season 7" "\"The Best Man\"\n\"The Naked Truth\"\…
## 9 "Season 8" "\"Farhampton\"\n\"The Pre-Nup\"\n\"Nan…
## 10 "Season 9" "\"The Locket\"\n\"Coming Back\"\n\"Las…
## 11 "Category" "Category"
#This generates a list with all the tables that contain the page. In our case,
#we want the table from the second element till the 10th.
l_tables_himym <- l_tables_himym[c(2:10)]
#Data cleaning to obtain clean tables
#Reduce the list in one data frame since all of the tables share the same structure
df_himym <- data.frame(Reduce(bind_rows, l_tables_himym))
#We do the same for the characters of HIMYM
l_tables_himym_characters <- url_himym_characters %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = TRUE)
df_characters <- as.data.frame(l_tables_himym_characters[[1]]) %>%
select(Character)
df_characters_w <- df_characters %>%
filter(!stringr::str_starts(Character, "Futu"),
!(Character %in% c("Character", "Main Characters", "Supporting Characters"))) %>%
mutate(name = str_extract(Character,"([^ ]+)"),
name = replace(name, name == "Dr.", "Sonya"))
df_himym <- data.frame(Reduce(bind_rows, l_tables_himym))
df_himym_filt <- df_himym %>% filter(str_length(No.overall) < 4)
df_himym_filt_dupl <- df_himym %>% filter(str_length(No.overall) > 4)
df_himym_filt_dupl_1 <- df_himym_filt_dupl %>%
mutate(No.overall = as.numeric(replace(No.overall, str_length(No.overall) > 4, substr(No.overall, 1, 3))),
No..inseason = as.numeric(replace(No..inseason, str_length(No..inseason) > 3, substr(No..inseason, 1, 2))),
Prod.code = replace (Prod.code, str_length(Prod.code) > 3, substr(Prod.code, 1, 6)))
df_himym_filt_dupl_2 <- df_himym_filt_dupl %>%
mutate(No.overall = as.numeric(replace(No.overall, str_length(No.overall) > 4, substr(No.overall, 4, 6))),
No..inseason = as.numeric(replace(No..inseason, str_length(No..inseason) > 3, substr(No..inseason, 3, 4))),
Title = replace(Title, Title == "\"The Magician's Code\"", "\"The Magician's Code Part 2\""),
Title = replace(Title, Title == "\"The Final Page\"", "\"The Final Page Part 2\""),
Title = replace(Title, Title == "\"Last Forever\"" , "\"Last Forever Part 2\"" ),
Prod.code = replace(Prod.code, str_length(Prod.code) > 3, substr(Prod.code, 7, 12)))
df_himym_final <- bind_rows(df_himym_filt,
df_himym_filt_dupl_1,
df_himym_filt_dupl_2) %>%
arrange(No.overall, No..inseason) %>%
mutate(year = str_extract(Original.air.date, '[0-9]{4}+'),
Season = as.numeric(stringr::str_extract(Prod.code, "^.{1}"))) %>%
rename(Chapter = No..inseason)
df_himym_final$US.viewers.millions. <- as.numeric(str_replace_all(df_himym_final$US.viewers.millions., "\\[[0-9]+\\]", ""))
datalist_himym <- readtext::readtext("texts/how-i-met-your-mother/*.txt")
v_season <- as.numeric(stringr::str_extract(datalist_himym$doc_id, "\\d+"))
v_chapter <- as.numeric(stringi::stri_extract_last_regex(datalist_himym$doc_id, "[0-9]+"))
datalist_himym_w <- datalist_himym %>% mutate(Season = v_season, Chapter = v_chapter)
df_himym_final_doc <- full_join(as.data.frame(datalist_himym_w), df_himym_final, by = c("Season", "Chapter")) %>%
mutate(Season_w = paste("Season", Season))
## Final dataframe to be used as a corpus text
df_himym_final_doc#write.csv(df_himym_final_doc, file = "df_himym_final_doc.csv")df_himym_final_docIf you went through session 2, you already get the data raw and can jump to session “4. Uploading data into Quanteda corpus”
Step 3.1 Import the csv database
# We need to think about an offline solution
#input <- read.csv("XXXXXXX.csv", stringsAsFactors = F)
#df_himym_final_doc <- OK! It’s showtime! Let’s upload all our data into a Quanteda corpus element.
Step 4.1 First Step: Define a corpus
corp_himym <- corpus(df_himym_final_doc) # build a new corpus from the texts
docnames(corp_himym) <- df_himym_final_doc$Title
summary(corp_himym, n = 15)Step 4.2 Convert corpus into tokens and wrangle it
corp_himym_cleaned <- tokens(corp_himym, remove_punct = TRUE,
remove_separators = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(stopwords("english"))Step 4.3 Create a DFM.
dfm_himym_cleaned <- corp_himym_cleaned %>% dfm()
tstat_simil <- textstat_simil(dfm_himym_cleaned)
tstat_dist <- textstat_dist(dfm_himym_cleaned)Let’s check the new subsetted DFM
dfm_himym_cleanedNow that we already upload the data and created the Quanteda elements, we can try some basics analysis.
Step 5.1. Let’s choose a specific season
# Filter corpus and create a subset DFM.
Season_choice <- 1
corpus_subset <- corpus_subset(corp_himym, Season == Season_choice)
tokens_himym_cleaned <- tokens(corpus_subset, remove_punct = TRUE,
remove_separators = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(stopwords("english"))
dfm_himym_subsetted <- tokens_himym_cleaned %>% dfm()Let’s check the new subsetted DFM
dfm_himym_subsettedStep 5.2 Do you think we should watch this show in the original sequence? Let’s check the similarity between episodes!
### Similarity between episodes
clust <- hclust(as.dist(tstat_simil))
dclust <- as.dendrogram(clust)
dclust <- reorder(dclust, 1:22)
nodePar <- list(lab.cex = 1, pch = c(NA, 19),
cex.axis=1.5,
cex = 2, col = "#0080ff")
#Talk about different methods above the correlation
par(mar=c(15,7,2,1))
plot(dclust,nodePar = nodePar,
cex.lab=2, cex.axis=2, cex.main=2, cex.sub=2,
main="How I Met Your Mother Season 1",
type = "triangle",ylim=c(0,1),
ylab = "Similarity between episodes (correlation %)",
edgePar = list(col = 4:7, lwd = 7:7),
panel.first = abline(h=c(seq(.10, 1, .10)), col="grey80"))
rect.hclust(clust, k = 5, border = "#ffb1b1")Step 5.3 Is there a correlation between the episodes?
## Distance between episodes (by correlation)
tstat_dist <- textstat_dist(toks_himym_dm_s1)
clust_dist <- hclust(as.dist(tstat_dist))
dclust_dist <- as.dendrogram(clust_dist)
dclust_dist <- reorder(dclust_dist, 22:1)
nodePar_2 <- list(lab.cex = 1.2, pch = c(NA, 19),
cex = 1.8, col = "#ffc733")
par(mar=c(15,7,2,1))
plot(dclust_dist, nodePar = nodePar_2,
cex.lab=2, cex.axis=2, cex.main=2, cex.sub=2,
main="How I Met Your Mother Season 1",
type = "triangle",ylim=c(0,120),
ylab = "Distance between episodes (correlation %)",
edgePar = list(col = 18:19, lwd = 7:7),
panel.first =abline(h=c(seq(10, 120, 10)), col="grey80"))
rect.hclust(clust_dist, k = 5, border = "#d80000")Step 5.4 What is the main actor for you? Does it depend on the season?
# Appearances of actors by season
## Characters by season
toks_himym <- tokens(corp_himym, remove_punct = TRUE,
remove_separators = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(stopwords("english"))#Add additional words
df_actors <- toks_himym %>% tokens_select(c("Ted", "Marshall", "Lily", "Robin", "Barney", "Mother")) %>%
tokens_group(groups = Season) %>%
dfm()
df_final_actors <- as.data.frame(textstat_frequency(df_actors, groups = c(1:9)))
ggplot(df_final_actors, aes(x = group, y = frequency, group = feature, color = feature)) +
geom_line() +
geom_point() +
ylim(0,580)# Let's try a different aproach!
## Wordcloud of principal characters that appears in HIMYM
toks_himym_characters <- tokens(corp_himym, remove_punct = TRUE,
remove_separators = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_keep(c(unique(df_characters_w$name)))
dfm_general_characters <- toks_himym_characters %>%
dfm()
textplot_wordcloud(dfm_general_characters,
random_order = FALSE,
rotation = 0.25,
min_count = 1, #Minimum frequency
color = RColorBrewer::brewer.pal(4, "Dark2"))Step 5.5 What is the most important secondary characters that appears in the TV show?
## Wordcloud of secondary characters that appears in HIMYM
toks_himym_sec_characters <- tokens(corp_himym,
remove_punct = TRUE,
remove_separators = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_keep(c(unique(df_characters_w$name))) %>%
tokens_remove(c("Ted", "Barney", "Lily", "Robin", "Marshall"))
dfm_general_sec_characters <- toks_himym_sec_characters %>%
dfm()
textplot_wordcloud(dfm_general_sec_characters,
random_order = FALSE,
rotation = 0.25,
#comparison = TRUE,
labelsize = 1.5,
min_count = 1, #Minimum frequency
color = RColorBrewer::brewer.pal(8, "Spectral"))Step 1 XXXXXXXXXXXXXXXXXXXXXXXXXXX`
#YYYYYYYYYYYYYYYYThWWWWWWWWWWWWWWWWWWWW
Step 1 XXXXXXXXXXXXXXXXXXXXXXXXXXX`
#YYYYYYYYYYYYYYYYZZZZZZZZZZZZZZZ